{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Hyperparameter Tuning using Your Own Keras/Tensorflow Container\n",
    "\n",
    "This notebook shows how to build your own Keras(Tensorflow) container, test it locally using SageMaker Python SDK local mode, and bring it to SageMaker for training, leveraging hyperparameter tuning. \n",
    "\n",
    "The model used for this notebook is a ResNet model, trainer with the CIFAR-10 dataset. The example is based on https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the notebook instance to support local mode\n",
    "Currently you need to install docker-compose in order to use local mode (i.e., testing the container in the notebook instance without pushing it to ECR)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!/bin/bash setup.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Permissions\n",
    "\n",
    "Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because it creates new repositories in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the environment\n",
    "We will set up a few things before starting the workflow. \n",
    "\n",
    "1. get the execution role which will be passed to sagemaker for accessing your resources such as s3 bucket\n",
    "2. specify the s3 bucket and prefix where training data set and model artifacts are stored"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import tempfile\n",
    "\n",
    "import tensorflow as tf\n",
    "\n",
    "import sagemaker\n",
    "import boto3\n",
    "from sagemaker.estimator import Estimator\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "sagemaker_session = sagemaker.Session()\n",
    "smclient = boto3.client('sagemaker')\n",
    "\n",
    "bucket = sagemaker.Session().default_bucket()  # s3 bucket name, must be in the same region as the one specified above\n",
    "prefix = 'sagemaker/DEMO-hpo-keras-cifar10'\n",
    "\n",
    "role = sagemaker.get_execution_role()\n",
    "\n",
    "NUM_CLASSES = 10   # the data set has 10 categories of images"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Complete source code\n",
    "- [trainer/start.py](trainer/start.py): Keras model\n",
    "- [trainer/environment.py](trainer/environment.py): Contain information about the SageMaker environment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building the image\n",
    "We will build the docker image using the Tensorflow versions on dockerhub. The full list of Tensorflow versions can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shlex\n",
    "import subprocess\n",
    "\n",
    "def get_image_name(ecr_repository, tensorflow_version_tag):\n",
    "    return '%s:tensorflow-%s' % (ecr_repository, tensorflow_version_tag)\n",
    "\n",
    "def build_image(name, version):\n",
    "    cmd = 'docker build -t %s --build-arg VERSION=%s -f Dockerfile .' % (name, version)\n",
    "    subprocess.check_call(shlex.split(cmd))\n",
    "\n",
    "#version tag can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/ \n",
    "#e.g., latest cpu version is 'latest', while latest gpu version is 'latest-gpu'\n",
    "tensorflow_version_tag = '1.10.1'   \n",
    "\n",
    "account = boto3.client('sts').get_caller_identity()['Account']\n",
    "    \n",
    "ecr_repository=\"%s.dkr.ecr.%s.amazonaws.com/test\" %(account,region) # your ECR repository, which you should have been created before running the notebook\n",
    "\n",
    "image_name = get_image_name(ecr_repository, tensorflow_version_tag)\n",
    "\n",
    "print('building image:'+image_name)\n",
    "build_image(image_name, tensorflow_version_tag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def upload_channel(channel_name, x, y):\n",
    "    y = tf.keras.utils.to_categorical(y, NUM_CLASSES)\n",
    "\n",
    "    file_path = tempfile.mkdtemp()\n",
    "    np.savez_compressed(os.path.join(file_path, 'cifar-10-npz-compressed.npz'), x=x, y=y)\n",
    "\n",
    "    return sagemaker_session.upload_data(path=file_path, bucket=bucket, key_prefix='data/DEMO-keras-cifar10/%s' % channel_name)\n",
    "\n",
    "\n",
    "def upload_training_data():\n",
    "    # The data, split between train and test sets:\n",
    "    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()\n",
    "\n",
    "    train_data_location = upload_channel('train', x_train, y_train)\n",
    "    test_data_location = upload_channel('test', x_test, y_test)\n",
    "\n",
    "    return {'train': train_data_location, 'test': test_data_location}\n",
    "\n",
    "channels = upload_training_data()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Testing the container locally (optional)\n",
    "\n",
    "You can test the container locally using local mode of SageMaker Python SDK. A training container will be created in the notebook instance based on the docker image you built. Note that we have not pushed the docker image to ECR yet since we are only running local mode here. You can skip to the tuning step if you want but testing the container locally can help you find issues quickly before kicking off the tuning job."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting the hyperparameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "hyperparameters = dict(batch_size=32, data_augmentation=True, learning_rate=.0001, \n",
    "                       width_shift_range=.1, height_shift_range=.1, epochs=1)\n",
    "hyperparameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a training job using local mode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "output_location = \"s3://{}/{}/output\".format(bucket,prefix)\n",
    "\n",
    "estimator = Estimator(image_name, role=role, output_path=output_location,\n",
    "                      train_instance_count=1, \n",
    "                      train_instance_type='local', hyperparameters=hyperparameters)\n",
    "estimator.fit(channels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pushing the container to ECR\n",
    "Now that we've tested the container locally and it works fine, we can move on to run the hyperparmeter tuning. Before kicking off the tuning job, you need to push the docker image to ECR first. \n",
    "\n",
    "The cell below will create the ECR repository, if it does not exist yet, and push the image to ECR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The name of our algorithm\n",
    "algorithm_name = 'test'\n",
    "\n",
    "# If the repository doesn't exist in ECR, create it.\n",
    "exist_repo = !aws ecr describe-repositories --repository-names {algorithm_name} > /dev/null 2>&1\n",
    "\n",
    "if not exist_repo:\n",
    "    !aws ecr create-repository --repository-name {algorithm_name} > /dev/null\n",
    "\n",
    "# Get the login command from ECR and execute it directly\n",
    "!$(aws ecr get-login --region {region} --no-include-email)\n",
    "\n",
    "!docker push {image_name}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Specify hyperparameter tuning job configuration\n",
    "*Note, with the default setting below, the hyperparameter tuning job can take 20~30 minutes to complete. You can customize the code in order to get better result, such as increasing the total number of training jobs, epochs, etc., with the understanding that the tuning time will be increased accordingly as well.*\n",
    "\n",
    "Now you configure the tuning job by defining a JSON object that you pass as the value of the TuningJobConfig parameter to the create_tuning_job call. In this JSON object, you specify:\n",
    "* The ranges of hyperparameters you want to tune\n",
    "* The limits of the resource the tuning job can consume \n",
    "* The objective metric for the tuning job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import json\n",
    "from time import gmtime, strftime\n",
    "\n",
    "tuning_job_name = 'BYO-keras-tuningjob-' + strftime(\"%d-%H-%M-%S\", gmtime())\n",
    "\n",
    "print(tuning_job_name)\n",
    "\n",
    "tuning_job_config = {\n",
    "    \"ParameterRanges\": {\n",
    "      \"CategoricalParameterRanges\": [],\n",
    "      \"ContinuousParameterRanges\": [\n",
    "        {\n",
    "          \"MaxValue\": \"0.001\",\n",
    "          \"MinValue\": \"0.0001\",\n",
    "          \"Name\": \"learning_rate\",          \n",
    "        }\n",
    "      ],\n",
    "      \"IntegerParameterRanges\": []\n",
    "    },\n",
    "    \"ResourceLimits\": {\n",
    "      \"MaxNumberOfTrainingJobs\": 9,\n",
    "      \"MaxParallelTrainingJobs\": 3\n",
    "    },\n",
    "    \"Strategy\": \"Bayesian\",\n",
    "    \"HyperParameterTuningJobObjective\": {\n",
    "      \"MetricName\": \"loss\",\n",
    "      \"Type\": \"Minimize\"\n",
    "    }\n",
    "  }\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Specify training job configuration\n",
    "Now you configure the training jobs the tuning job launches by defining a JSON object that you pass as the value of the TrainingJobDefinition parameter to the create_tuning_job call.\n",
    "In this JSON object, you specify:\n",
    "* Metrics that the training jobs emit\n",
    "* The container image for the algorithm to train\n",
    "* The input configuration for your training and test data\n",
    "* Configuration for the output of the algorithm\n",
    "* The values of any algorithm hyperparameters that are not tuned in the tuning job\n",
    "* The type of instance to use for the training jobs\n",
    "* The stopping condition for the training jobs\n",
    "\n",
    "This example defines one metric that Tensorflow container emits: loss. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_image = image_name\n",
    "\n",
    "print('training artifacts will be uploaded to: {}'.format(output_location))\n",
    "\n",
    "training_job_definition = {\n",
    "    \"AlgorithmSpecification\": {\n",
    "      \"MetricDefinitions\": [\n",
    "        {\n",
    "          \"Name\": \"loss\",\n",
    "          \"Regex\": \"loss: ([0-9\\\\.]+)\"\n",
    "        }\n",
    "      ],\n",
    "      \"TrainingImage\": training_image,\n",
    "      \"TrainingInputMode\": \"File\"\n",
    "    },\n",
    "    \"InputDataConfig\": [\n",
    "        {\n",
    "            \"ChannelName\": \"train\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": channels['train'],\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },\n",
    "            \"CompressionType\": \"None\",\n",
    "            \"RecordWrapperType\": \"None\"\n",
    "        },\n",
    "        {\n",
    "            \"ChannelName\": \"test\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": channels['test'],\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },            \n",
    "            \"CompressionType\": \"None\",\n",
    "            \"RecordWrapperType\": \"None\"            \n",
    "        }\n",
    "    ],\n",
    "    \"OutputDataConfig\": {\n",
    "      \"S3OutputPath\": \"s3://{}/{}/output\".format(bucket,prefix)\n",
    "    },\n",
    "    \"ResourceConfig\": {\n",
    "      \"InstanceCount\": 1,\n",
    "      \"InstanceType\": \"ml.m4.xlarge\",\n",
    "      \"VolumeSizeInGB\": 50\n",
    "    },\n",
    "    \"RoleArn\": role,\n",
    "    \"StaticHyperParameters\": {\n",
    "        \"batch_size\":\"32\",\n",
    "        \"data_augmentation\":\"True\",\n",
    "        \"height_shift_range\":\"0.1\",\n",
    "        \"width_shift_range\":\"0.1\",\n",
    "        \"epochs\":'1'\n",
    "    },\n",
    "    \"StoppingCondition\": {\n",
    "      \"MaxRuntimeInSeconds\": 43200\n",
    "    }\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create and launch a hyperparameter tuning job\n",
    "Now you can launch a hyperparameter tuning job by calling create_tuning_job API. Pass the name and JSON objects you created in previous steps as the values of the parameters. After the tuning job is created, you should be able to describe the tuning job to see its progress in the next step, and you can go to SageMaker console->Jobs to check out the progress of each training job that has been created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,\n",
    "                                               HyperParameterTuningJobConfig = tuning_job_config,\n",
    "                                               TrainingJobDefinition = training_job_definition)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully and is `InProgress`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name)['HyperParameterTuningJobStatus']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyze tuning job results - after tuning job is completed\n",
    "Please refer to \"HPO_Analyze_TuningJob_Results.ipynb\" to see example code to analyze the tuning job results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy the best model\n",
    "Now that we have got the best model, we can deploy it to an endpoint. Please refer to other SageMaker sample notebooks or SageMaker documentation to see how to deploy a model."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_tensorflow_p36",
   "language": "python",
   "name": "conda_tensorflow_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
